[SPARK-3094] [PySpark] compatitable with PyPy #2144

davies · 2014-08-26T19:27:33Z

After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:

PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py

The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:

Job	CPython 2.7	PyPy 2.3.1	Speed up
Word Count	41s	15s	2.7x
Sort	46s	44s	1.05x
Stats	174s	3.6s	48x

Here is the code used for benchmark:

rdd = sc.textFile("text")
def wordcount():
    rdd.flatMap(lambda x:x.split('/'))\
        .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
    rdd.sortBy(lambda x:x, 1).count()
def stats():
    sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()

SparkQA · 2014-08-26T19:31:15Z

QA tests have started for PR 2144 at commit 25b4ca7.

This patch merges cleanly.

SparkQA · 2014-08-26T21:31:15Z

Tests timed out after a configured wait of 120m.

davies · 2014-08-26T21:51:30Z

Jenkins, retest this please.

SparkQA · 2014-08-26T21:55:54Z

QA tests have started for PR 2144 at commit 25b4ca7.

This patch merges cleanly.

SparkQA · 2014-08-26T23:55:55Z

Tests timed out after a configured wait of 120m.

mateiz · 2014-08-27T00:37:13Z

This looks like it will be tricky to maintain without automated testing. Can you update dev/run-tests to also run the PySpark tests with PyPy maybe? You might need help from Patrick or others on installing pypy on the Jenkins machines.

SparkQA · 2014-08-28T19:34:16Z

QA tests have started for PR 2144 at commit 9986692.

This patch merges cleanly.

SparkQA · 2014-08-28T21:34:17Z

Tests timed out after a configured wait of 120m.

SparkQA · 2014-08-28T21:59:05Z

QA tests have started for PR 2144 at commit cb2d724.

This patch merges cleanly.

SparkQA · 2014-08-28T22:55:24Z

QA tests have finished for PR 2144 at commit cb2d724.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ItemGetterType(ctypes.Structure):
- class AttrGetterType(ctypes.Structure):

SparkQA · 2014-08-29T22:49:48Z

QA tests have started for PR 2144 at commit 42fb5fa.

This patch merges cleanly.

SparkQA · 2014-08-29T23:46:23Z

QA tests have finished for PR 2144 at commit 42fb5fa.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ItemGetterType(ctypes.Structure):
- class AttrGetterType(ctypes.Structure):
- case class Sqrt(child: Expression) extends UnaryExpression
- class TreeNodeRef(val obj: TreeNode[_])

mateiz · 2014-08-30T21:03:51Z

@davies just curious, do all the unit tests run if you do run-tests with pypy? We should make sure they do, and add a command in there to test this in Jenkins (ask Patrick for help with this).

davies · 2014-08-31T06:31:55Z

Yes, I will do that next week.

JoshRosen · 2014-09-04T03:12:55Z

One concern with adding tests for pypy is that it might significantly increase the runtime of the Jenkins tests. We should test regularly with pypy to make sure that we don't break compatibility, but we might not need to run those tests with every commit.

mateiz · 2014-09-04T03:24:29Z

How long do the Python tests run now?

Anyway, we could do PyPy only if Python code changed (but I'd still do Python all the time).

davies · 2014-09-04T17:26:20Z

PyPy is fully compatible with CPython for pure Python code, so it's not necessary to test against every commit with PyPy.

Maybe we could have nightly tests (for performance or scalability), we could put PyPy in this kind of tests.

davies · 2014-09-04T22:28:37Z

PyPy does not fully support NumPy right now, so MLlib can not run with PyPy.

mateiz · 2014-09-05T00:39:51Z

So you guys should figure out a way to run this so that it doesn't get stale. For example it's fine to add some code to the script that runs all the tests except the MLlib ones. But there's little point merging it unless we also automatically test that it keeps working, otherwise we'll only notice breakage on each release (if we remember to test with PyPy).

mattf · 2014-09-05T01:01:33Z

So you guys should figure out a way to run this so that it doesn't get stale. For example it's fine to add some code to the script that runs all the tests except the MLlib ones. But there's little point merging it unless we also automatically test that it keeps working, otherwise we'll only notice breakage on each release (if we remember to test with PyPy).

+1

JoshRosen · 2014-09-05T01:16:22Z

Let's just have the PyPy tests run by default on Jenkins. If this causes build speed problems later down the road, we can revisit the issue of selectively running tests.

davies · 2014-09-05T17:19:13Z

@mateiz @JoshRosen @mattf run-tests will try to run tests for spark core and sql with PyPy.

One known issue is that serialization of array in PyPy is similar to Python2.6, which is not supported by Pyrite, so one test cases has been skipped for them. I had added another one which do not depend on serialization of array.

Also I had added some refactor in cloudpickle to do it in more portable ways (which is also used by dill).

davies · 2014-09-05T19:18:10Z

Jenkins, test this please.

SparkQA · 2014-09-05T23:53:49Z

QA tests have started for PR 2144 at commit fae8b19.

This patch merges cleanly.

SparkQA · 2014-09-06T01:28:37Z

QA tests have finished for PR 2144 at commit fae8b19.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-06T23:20:10Z

I'm waiting to figure out the right procedure for installing pypy on our Jenkins boxes; once I have this figured out, I'll loop back to finish reviewing. From what I've seen so far, though, this looks good, although I should take a more careful look over the cloudpickle refactoring to make sure that I understand what's happening.

JoshRosen · 2014-09-11T05:08:04Z

Thanks to @shaneknapp we now have pypy-2.0.2-1.el6.x86_64 on the Jenkins workers, so I'm going to try retesting this.

SparkQA · 2014-09-11T05:10:49Z

QA tests have started for PR 2144 at commit fae8b19.

This patch merges cleanly.

SparkQA · 2014-09-11T06:14:58Z

QA tests have finished for PR 2144 at commit fae8b19.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Dummy(object):

SparkQA · 2014-09-11T17:09:15Z

QA tests have started for PR 2144 at commit b20ab3a.

This patch merges cleanly.

SparkQA · 2014-09-11T17:19:15Z

QA tests have started for PR 2144 at commit 9aed6c5.

This patch merges cleanly.

SparkQA · 2014-09-11T18:15:26Z

QA tests have finished for PR 2144 at commit b20ab3a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Dummy(object):

SparkQA · 2014-09-11T18:25:51Z

QA tests have finished for PR 2144 at commit 9aed6c5.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Dummy(object):

JoshRosen · 2014-09-13T01:42:29Z

This looks good to me (Davies and I walked through the code offline). I'm going to merge this into master. Thanks!

function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by #2144, revert part of changes in that PR. cc JoshRosen Author: Davies Liu <[email protected]> Closes #2522 from davies/globals and squashes the following commits: dfbccf5 [Davies Liu] fix bug while pickle globals of function

staslos · 2015-07-20T21:00:31Z

Sorry for the stupid question. Does this work on executors in YARN cluster? Or it's just for local mode only?

support PyPy

25b4ca7

Merge branch 'master' into pypy

9986692

fix tests

cb2d724

Merge branch 'master' into pypy

42fb5fa

JoshRosen mentioned this pull request Sep 1, 2014

leverage cloudpickle for missing functionality in dill uqfoundation/dill#65

Closed

davies added 3 commits September 4, 2014 10:30

Merge branch 'master' into pypy

3c1dbfe

serialize itemgetter/attrgetter in portable ways

1b98fb3

fix tests using array with PyPy

f651fd0

cleanup

c8d62ba

try to run tests with PyPy in run-tests

591f830

improve attrgetter, add tests

fae8b19

davies force-pushed the pypy branch from f5f2db5 to fae8b19 Compare September 5, 2014 17:52

davies mentioned this pull request Sep 5, 2014

[SPARK-3415] [PySpark] removes SerializingAdapter code #2287

Closed

davies added 2 commits September 11, 2014 09:56

Merge branch 'master' into pypy

3ca2351

pickle sys.stdout and stderr in portable way

b20ab3a

davies added 2 commits September 11, 2014 10:10

refactor

4bc1f04

use protocol 2 in CloudPickle

9aed6c5

asfgit closed this in 71af030 Sep 13, 2014

davies deleted the pypy branch September 15, 2014 22:19

JoshRosen mentioned this pull request Sep 24, 2014

[SPARK-3679] [PySpark] pickle the exact globals of functions #2522

Closed

[SPARK-3094] [PySpark] compatitable with PyPy #2144

[SPARK-3094] [PySpark] compatitable with PyPy #2144

Uh oh!

Conversation

davies commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

davies commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

mateiz commented Aug 27, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

SparkQA commented Aug 29, 2014

Uh oh!

mateiz commented Aug 30, 2014

Uh oh!

davies commented Aug 31, 2014

Uh oh!

JoshRosen commented Sep 4, 2014

Uh oh!

mateiz commented Sep 4, 2014

Uh oh!

davies commented Sep 4, 2014

Uh oh!

davies commented Sep 4, 2014

Uh oh!

mateiz commented Sep 5, 2014

Uh oh!

mattf commented Sep 5, 2014

Uh oh!

JoshRosen commented Sep 5, 2014

Uh oh!

davies commented Sep 5, 2014

Uh oh!

davies commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

JoshRosen commented Sep 6, 2014

Uh oh!

JoshRosen commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

SparkQA commented Sep 11, 2014

Uh oh!

JoshRosen commented Sep 13, 2014

Uh oh!

staslos commented Jul 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects